Everything Totally Explained


Ask & we'll explain, totally!
Linear regression
Totally Explained


  FOR SALE!Either this or the left-hand panel are available for just $19.95 per
day, or you can have both for only $34.95! Contact us for details.  


View this entry using RSS

Everything about Linear Regression totally explained

Linear regression is a form of regression analysis in which data are modeled by a least squares function which is a linear combination of the model parameters and depends on one or more independent variables. In simple linear regression the model function represents a straight line. The results of data fitting are subject to statistical analysis.

Definitions

The data consist of m values, y_1, ldots, y_m of the dependent variable (response variable), y, derived from observations. The dependent variable is subject to error. This error is assumed to be random variable, with a mean of zero. Systematic error (for example mean ≠ 0) may be present but its treatment is outside the scope of regression analysis. The independent variable (explanatory variable) x, is error-free. If this isn't so, modeling should be done using errors-in-variables model techniques. The independent variables are also called regressors, exogenous variables, input variables and predictor variables.
   In general there are n parameters to be determined, eta_1, ldots, eta_n . The model is a linear combination of these parameters » y_i = sum_ = 2.2. Therefore, we can say that the 95% confidence intervals are:
» eta_0in[92.9,164.7]

» eta_1in[-186.8,-99.5]

» eta_2in[48.7,75.2]

Examining results of regression models

Checking model assumptions

The model assumptions are checked by calculating the residuals and plotting them. The following plots can be constructed to test the validity of the assumptions:
  • Residuals against the explanatory variables in the model, as illustrated above. The residuals should have no relation to these variables (look for possible non-linear relations) and the spread of the residuals should be the same over the whole range.
  • Residuals against explanatory variables not in the model. Any relation of the residuals to these varibles would suggest considering these varibles for inclusion in the model.
  • Residuals against the fitted values, hatmathbf y,.
  • A time series plot of the residuals, that is, plotting the residuals as a function of time.
  • Residuals against the preceding residual.
  • A normal probability plot of the residuals to test normality. The points should lie along a straight line. There shouldn't be any noticeable pattern to the data in all but the last plot

    Checking model structure

    The structure of the model, in terms of whether all variables need to be included, can be checked using any of the following methods:
  • Using the confidence interval for each of the parameters, hateta_j . If the confidence interval includes 0, then the parameter can be removed from the model. Ideally, a new regression analysis excluding that parameter would need to be performed and continued until there are no more parameters to remove. This is equivalent to using a t-test for each variable.
  • Computing F-statistics to check whether groups of explanatory variables can be removed.

    Seeing how good the model is

  • When fitting a straight line, calculate the coefficient of determination. The closer the value is to 1; the better the regression is. This coefficient gives what fraction of the observed behaviour can be explained by the given variables.
  • Examining the observational and prediction confidence intervals. The smaller they're the better.

    Other procedures

    Weighted least squares

    Weighted least squares is a generalisation of the least squares method, used when the observational errors have unequal variance.

    Errors-in-variables model

    Errors-in-variables model or total least squares when the independent variable is subject to error

    Generalized linear model

    Generalized linear model is used when the distribution function of the errors isn't a Normal distribution. Examples include Exponential distribution, gamma distribution, Inverse Gaussian distribution, Poisson distribution, binomial distribution, multinomial distribution

    Robust regression

    A host of alternative approaches to the computation of regression parameters are included in the category known as robust regression. One technique minimizes the mean absolute error, or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this isn't true when the variance or mean of the error distribution isn't bounded, or when an analyst that can identify outliers is unavailable.
       Among Stata users, Robust regression is frequently taken to mean linear regression with Huber-White standard error estimates due to the naming conventions for regression commands. This procedure relaxes the assumption of homoscedasticity for variance estimates only; the predictors are still ordinary least squares (OLS) estimates. This occasionally leads to confusion; Stata users sometimes believe that linear regression is a robust method when this option is used, although it's actually not robust in the sense of outlier-resistance.

    Applications of linear regression

    Linear regression is widely used in biological, behavioral and social sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines.

    The trend line

    » For trend lines as used in technical analysis, see Trend lines (technical analysis)

    A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.
       Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and doesn't require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.

    Medicine

    As one example, early evidence relating tobacco smoking to mortality and morbidity came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality isn't due to some effect of education or income. However, it's never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trials are considered to be more trustworthy than a regression analysis.

    Finance

    The capital asset pricing model uses linear regression as well as the concept of Beta for analyzing and quantifying the systematic risk of an investment. This comes directly from the Beta coefficient of the linear regression model that relates the return on the investment to the return on all risky assets.
       Regression may not be the appropriate way to estimate beta in finance given that it's supposed to provide the volatility of an investment relative to the volatility of the market as a whole. This would require that both these variables be treated in the same way when estimating the slope. Whereas regression treats all variability as being in the investment returns variable, for example it only considers residuals in the dependent variable.

    Further Information

    Get more info on 'Linear Regression'.


    External Link Exchanges

    Do you know how hard it is to get a link from a large encyclopaedia? Well we're different and will prove it. To get a link from us just add the following HTML to your site on a relevant page:

      <a href="http://linear_regression.totallyexplained.com">Linear regression Totally Explained</a>

    Then simply click through this link from your web page. Our crawlers will verify your link, extract the title of your web page and instantly add a link back to it. If you like you can remove the words Totally Explained and embed the link in article text.
       As long as your link remains in place, we'll keep our link to you right here. Please play fair - our crawlers are watching. Your site must be closely related to this one's topic. Any kind of spamming, dubious practises or removing the link will result in your link from us being dropped and, potentially, your whole site being banned.



  • Copyright © 2007-8 totallyexplained.com | Licensed under the GNU Free Documentation License | Site Map
    This article contains text from the Wikipedia article Linear regression (History) and is released under the GFDL | RSS Version